Skip to content

Fix search_graph name_pattern= performance: regex cache, LIKE pre-filter, cheap count#300

Merged
DeusData merged 3 commits intoDeusData:mainfrom
arbor-education:fix/254-search-graph-name-pattern-performance
May 10, 2026
Merged

Fix search_graph name_pattern= performance: regex cache, LIKE pre-filter, cheap count#300
DeusData merged 3 commits intoDeusData:mainfrom
arbor-education:fix/254-search-graph-name-pattern-performance

Conversation

@awconstable
Copy link
Copy Markdown
Contributor

Fixes #254

Root cause

Three compounding bugs caused name_pattern= searches to scan every node with an expensive compiled regex, regardless of how selective the pattern was:

  1. sqlite_iregexp / sqlite_regexp recompiled the regex on every row — cbm_regcomp + cbm_regfree fired once per node for the full table.
  2. The count query wrapped the full SELECT (including two correlated edge-count subqueries per row) in SELECT COUNT(*) FROM (...), doubling the scan with identical per-row overhead.
  3. cbm_extract_like_hints was implemented and correct but never called — the LIKE pre-filter that should cut the regex scan to only matching rows was dead code.

Changes

Fix 1 — regex cached per statement (sqlite_regexp / sqlite_iregexp)
Use sqlite3_get_auxdata / sqlite3_set_auxdata to cache the compiled cbm_regex_t for the lifetime of the statement. cbm_regcomp is now called exactly once per query, not once per row.

Fix 2 — LIKE pre-filter wired in (where_add_like_hints, search_where_basic)
Wire cbm_extract_like_hints into search_where_basic via a new where_add_like_hints helper. For .*Controller.* this prepends n.name LIKE '%Controller%'; the idx_nodes_name index satisfies the LIKE clause and only matching rows reach iregexp(). Added search_like_pool_t to manage the malloc'd LIKE strings across both statement executions. ST_SEARCH_MAX_BINDS raised 16 → 32.

Fix 3 — count query stripped of per-row edge subqueries
For the common no-degree-filter path, the count SQL is now SELECT COUNT(*) FROM nodes n WHERE <same WHERE> — no correlated edges subqueries. The degree-filter path retains the wrapped form since it needs those columns for the filter.

Benchmark

Tested on a large PHP codebase (~200K nodes):

Query Before After Speedup
name_pattern=.*Controller.* 3099ms 508ms
name_pattern=.*Service.* 2006ms 506ms
name_pattern=.*Repository.* 2006ms 508ms
name_pattern=specificFunctionName 1506ms 507ms
label=Method + name_pattern=.*get.* 8509ms 509ms 17×

The ~500ms floor is cold-start I/O when spawning a fresh process against a ~500MB database. In the long-running MCP server (warm file cache) the query time is sub-millisecond.

A reusable benchmark script is included at scripts/benchmark-search-graph.sh.

Tests

All store search tests pass including store_search_pagination (offset-past-end total count), store_search_degree_filter, and the full store_extract_like_hints suite.

awconstable and others added 3 commits April 30, 2026 06:38
…ter, cheap count

Three compounding bugs caused 1.5–8.5s latency on name_pattern= searches against
large projects (216K nodes), now reduced to ~0ms query time (cold-start dominates):

Fix 1 — regex compiled once per statement, not once per row
  sqlite_regexp / sqlite_iregexp now use sqlite3_get_auxdata / sqlite3_set_auxdata
  to cache the compiled cbm_regex_t for the lifetime of the statement. Previously
  cbm_regcomp + cbm_regfree ran for every row scanned.

Fix 2 — LIKE pre-filter cuts rows reaching the regex
  Wire cbm_extract_like_hints (already implemented but dead) into search_where_basic
  via a new where_add_like_hints helper. For .*Controller.* this prepends
  n.name LIKE '%Controller%', letting the idx_nodes_name index satisfy the LIKE
  clause first and passing only matching rows to iregexp(). Added search_like_pool_t
  to manage the malloc'd LIKE strings across both statement executions.
  ST_SEARCH_MAX_BINDS raised 16 → 32 to accommodate extra bind slots.

Fix 3 — count query no longer runs per-row edge subqueries
  The count SQL previously wrapped the full SELECT (which includes two correlated
  subqueries for in_deg / out_deg) in SELECT COUNT(*) FROM (...), executing those
  edge counts for every matching row even though the count needs none of that.
  Non-degree-filter path now uses SELECT COUNT(*) FROM nodes n WHERE <same WHERE>,
  which has no per-row subqueries. Degree-filter path retains the wrapped form
  since it needs those columns for the filter.

Benchmark on home-ubuntu-dev-sis (216K nodes, 509MB DB):

  Query                                BEFORE    AFTER   speedup
  name_pattern=.*Controller.*          3099ms    508ms     6×
  name_pattern=.*Service.*             2006ms    506ms     4×
  name_pattern=.*Repository.*          2006ms    508ms     4×
  name_pattern=specificFuncName        1506ms    507ms     3×
  label=Method + name_pattern=.*get.*  8509ms    509ms    17×
  name_pattern=.*Approve.*             1506ms    507ms     3×
  name_pattern=.*authorize.*           1506ms    509ms     3×

The ~500ms floor is cold-start I/O (opening a 509MB file from disk). In the
long-running MCP server process the warm-cache query time is sub-millisecond.

All store search tests pass including pagination, degree filter, and extract_like_hints.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Make project a required CLI argument instead of a hardcoded name,
and remove internal query strings used during development testing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Flat BM25 queries of the form:
  SELECT ... FROM nodes_fts JOIN nodes WHERE MATCH ? AND project=? ORDER BY bm25() LIMIT N
block FTS5 WAND/MaxScore early-exit — the outer JOIN+WHERE is invisible to
the FTS5 planner, so it scores every matching document before any filter fires.
On a large codebase with 100K+ matches this causes 2–16 minute queries.

Fix: two-step subquery.  The inner FTS5-only query:
  SELECT rowid, bm25(nodes_fts) FROM nodes_fts WHERE MATCH ? ORDER BY bm25() LIMIT 2000
can early-terminate because no outer predicate blocks it.  The outer query
then joins and filters at most BM25_INNER_LIMIT (2000) candidates.

The count query uses the identical inner-limit subquery, so it benefits too.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@DeusData DeusData added bug Something isn't working stability/performance Server crashes, OOM, hangs, high CPU/memory labels May 4, 2026
@DeusData DeusData merged commit 5f19454 into DeusData:main May 10, 2026
@DeusData
Copy link
Copy Markdown
Owner

Merged via rebase, thanks @awconstable — diagnoses are spot-on, fixes are clean, benchmarks reproduce. The auxdata caching is the canonical SQLite pattern, the LIKE pre-filter wiring is well-scoped (search_like_pool_t correctly handles the SQLITE_STATIC bind lifetime across both count and main statements), and the count-query split between the no-degree-filter and degree-filter paths is exactly right.

A note for anyone reading this thread later: the branch also contained the FTS5 two-step subquery fix that #302 was targeting, so #302 is now superseded — closing it as resolved.

Soft behavior note worth flagging on the FTS5 path: BM25_INNER_LIMIT = 2000 caps the inner candidate set, so the total reported to callers is now bounded by 2000 (or fewer post-filter). That makes pagination beyond offset 2000 silently saturate. Practically fine for ranked search — page 100 of search results was never going to be useful — but if anyone hits it later, the constant is the place to lift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working stability/performance Server crashes, OOM, hangs, high CPU/memory

Projects

None yet

Development

Successfully merging this pull request may close these issues.

search_graph on large datasets with name_pattern= is slow

2 participants